Curriculum learning and self-paced learning are the training strategies that gradually feed the samples from easy to more complex. They have captivated increasing attention due to their excellent performance in robotic vision. Most recent works focus on designing curricula based on difficulty levels in input samples or smoothing the feature maps. However, smoothing labels to control the learning utility in a curriculum manner is still unexplored. In this work, we design a paced curriculum by label smoothing (P-CBLS) using paced learning with uniform label smoothing (ULS) for classification tasks and fuse uniform and spatially varying label smoothing (SVLS) for semantic segmentation tasks in a curriculum manner. In ULS and SVLS, a bigger smoothing factor value enforces a heavy smoothing penalty in the true label and limits learning less information. Therefore, we design the curriculum by label smoothing (CBLS). We set a bigger smoothing value at the beginning of training and gradually decreased it to zero to control the model learning utility from lower to higher. We also designed a confidence-aware pacing function and combined it with our CBLS to investigate the benefits of various curricula. The proposed techniques are validated on four robotic surgery datasets of multi-class, multi-label classification, captioning, and segmentation tasks. We also investigate the robustness of our method by corrupting validation data into different severity levels. Our extensive analysis shows that the proposed method improves prediction accuracy and robustness.
translated by 谷歌翻译
Purpose: Surgery scene understanding with tool-tissue interaction recognition and automatic report generation can play an important role in intra-operative guidance, decision-making and postoperative analysis in robotic surgery. However, domain shifts between different surgeries with inter and intra-patient variation and novel instruments' appearance degrade the performance of model prediction. Moreover, it requires output from multiple models, which can be computationally expensive and affect real-time performance. Methodology: A multi-task learning (MTL) model is proposed for surgical report generation and tool-tissue interaction prediction that deals with domain shift problems. The model forms of shared feature extractor, mesh-transformer branch for captioning and graph attention branch for tool-tissue interaction prediction. The shared feature extractor employs class incremental contrastive learning (CICL) to tackle intensity shift and novel class appearance in the target domain. We design Laplacian of Gaussian (LoG) based curriculum learning into both shared and task-specific branches to enhance model learning. We incorporate a task-aware asynchronous MTL optimization technique to fine-tune the shared weights and converge both tasks optimally. Results: The proposed MTL model trained using task-aware optimization and fine-tuning techniques reported a balanced performance (BLEU score of 0.4049 for scene captioning and accuracy of 0.3508 for interaction detection) for both tasks on the target domain and performed on-par with single-task models in domain adaptation. Conclusion: The proposed multi-task model was able to adapt to domain shifts, incorporate novel instruments in the target domain, and perform tool-tissue interaction detection and report generation on par with single-task models.
translated by 谷歌翻译
深度卷积神经网络在各种计算机视觉任务上表现出色,但是它们容易从训练信号中拾取虚假相关性。所谓的“快捷方式”可以在学习过程中发生,例如,当图像数据中存在特定频率与输出预测相关的特定频率时。高频和低频都可以是由图像采集引起的潜在噪声分布的特征,而不是与有关图像内容的任务相关信息。学习与此特征噪声相关的功能的模型不会很好地推广到新数据。在这项工作中,我们提出了一种简单而有效的训练策略,频率辍学,以防止卷积神经网络从学习频率特异性成像功能中。我们在训练过程中采用了特征图的随机过滤,该特征地图充当特征级别的正则化。在这项研究中,我们考虑了常见的图像处理过滤器,例如高斯平滑,高斯(Gaussian)的拉普拉斯(Laplacian)和Gabor过滤。我们的培训策略是模型不合时宜的,可用于任何计算机视觉任务。我们证明了使用计算机视觉和医学成像数据集在一系列流行架构和多个任务中的频率辍学的有效性。我们的结果表明,所提出的方法不仅提高了预测准确性,而且还提高了针对领域转移的鲁棒性。
translated by 谷歌翻译
机器学习模型通常部署在与训练设置不同的测试设置中,可能会导致由于域移动而导致模型性能下降。如果我们可以估计预先训练的模型将在特定部署设置(例如某个诊所)上实现的性能,我们可以判断该模型是否可以安全部署,或者其性能是否在特定数据上不可接受。现有方法基于对部署域中未标记的测试数据的预测信心进行估算。我们发现现有的方法与呈现阶级失衡的数据困难,因为用于校准置信度的方法不会考虑阶级不平衡引起的偏见,因此未能估算阶级的准确性。在这里,我们在不平衡数据集的性能估计框架内介绍了班级校准。具体而言,我们得出了基于最新置信度的模型评估方法(包括温度缩放(TS),信心差异(DOC)和平均阈值置信度(A​​TC))的最新置信度评估方法的特定于类的修改。我们还将方法扩展到图像分割中的骰子相似性系数(DSC)。我们对四个任务进行实验,并找到所提出的修改一致提高了数据集的估计精度。与先前方法相比,我们的方法在自然域移动下的分类中提高了准确性估计,在自然域移动下的分类中提高了18 \%的估计精度。
translated by 谷歌翻译
课程学习需要示例难以从轻松到硬进行。但是,很少研究图像难度的信誉,这会严重影响课程的有效性。在这项工作中,我们提出了角度差距,这是基于特征嵌入和通过超球体学习构建的类别嵌入和类体重嵌入的角度差异的难度度量。为了确定难度估计,我们将按班级模型校准作为培训后技术引入学习的双曲线空间。这弥合了概率模型校准与超透明学习的角度距离估计之间的差距。我们显示了校准的角度差距的优越性,而不是最近在CIFAR10-H和ImagenEtV2上的难度指标。我们进一步提出了基于角度间隙的课程学习,以进行无监督的域适应性,从而可以从学习简易样品转化为采矿硬样品。我们将该课程与最先进的自我训练方法(CST)相结合。拟议的课程CST学习了强大的表示形式,并且在Office31和Visda 2017上的最新基准都优于最近的基线。
translated by 谷歌翻译
手术字幕在手术指导预测和报告生成中起重要作用。但是,大多数字幕模型仍然依赖重量计算对象检测器或特征提取器来提取区域特征。此外,检测模型需要其他边界框注释,这是昂贵的,需要熟练的注释器。这些导致推断延迟,并限制字幕模型在实时机器人手术中部署。为此,我们通过利用基于贴片的移位窗口技术来设计端到端检测器和功能无提取器字幕模型。我们建议以更快的推理速度和更少的计算,建议基于窗口的多层感知器变压器字幕模型(SWINMLP-TRANCAP)。 SwinMLP-Trancap用基于窗口的多头MLP代替了多头注意模块。这样的部署主要集中在图像理解任务上,但是很少有工作研究标题生成任务。 Swinmlp-trancap还扩展到视频版本,用于使用3D补丁和Windows的视频字幕任务。与以前的基于检测器或基于特征提取器的模型相比,我们的模型在维护两个手术数据集上的性能的同时,大大简化了体系结构设计。该代码可在https://github.com/xumengyaamy/swinmlp_trancap上公开获得。
translated by 谷歌翻译
数据多样性和数量对于培训深度学习模型的成功至关重要,而在医学成像领域,数据收集和注释的难度和成本尤其巨大。特别是在机器人手术方面,数据稀缺性和失衡严重影响了模型的准确性,并限制了基于深度学习的手术应用(例如手术仪器分割)的设计和部署。考虑到这一点,在本文中,我们重新考虑了手术仪器分割任务,并提出了一种一对多的数据生成解决方案,该解决方案摆脱了复杂且昂贵的数据收集过程和机器人手术的注释。在我们的方法中,我们仅利用单个手术背景组织图像和一些开源仪器图像作为种子图像,并应用多种增强和混合技术来合成大量图像变化。此外,我们还引入了训练期间链式的增强混合,以进一步增强数据多样性。在Endovis-2018和Endovis-2017手术场景分割的真实数据集中评估了所提出的方法。我们的经验分析表明,如果没有高度的数据收集和注释成本,我们就可以实现不错的手术仪器分割性能。此外,我们还观察到我们的方法可以处理部署领域中的新仪器预测。我们希望我们的鼓舞人心的结果能够鼓励研究人员强调以数据为中心的方法,以克服除数据短缺(例如类不平衡,域适应性和增量学习)之外的深度学习限制。
translated by 谷歌翻译
手术中的视觉问题回答(VQA)在很大程度上没有探索。专家外科医生稀缺,经常被临床和学术工作负载超负荷。这种超负荷通常会限制他们从患者,医学生或初级居民与手术程序有关的时间回答问卷。有时,学生和初级居民也不要在课堂上提出太多问题以减少干扰。尽管计算机辅助的模拟器和过去的手术程序记录已经可以让他们观察和提高技能,但他们仍然非常依靠医学专家来回答他们的问题。将手术VQA系统作为可靠的“第二意见”可以作为备份,并减轻医疗专家回答这些问题的负担。缺乏注释的医学数据和特定于域的术语的存在限制了对手术程序的VQA探索。在这项工作中,我们设计了一项外科VQA任务,该任务根据外科手术场景回答有关手术程序的问卷。扩展MICCAI内窥镜视觉挑战2018数据集和工作流识别数据集,我们介绍了两个具有分类和基于句子的答案的手术VQA数据集。为了执行手术VQA,我们采用视觉文本变压器模型。我们进一步介绍了一个基于MLP的剩余Visualbert编码器模型,该模型可以在视觉令牌和文本令牌之间进行相互作用,从而改善了基于分类的答案的性能。此外,我们研究了输入图像贴片数量和时间视觉特征对分类和基于句子的答案中模型性能的影响。
translated by 谷歌翻译
Context-aware decision support in the operating room can foster surgical safety and efficiency by leveraging real-time feedback from surgical workflow analysis. Most existing works recognize surgical activities at a coarse-grained level, such as phases, steps or events, leaving out fine-grained interaction details about the surgical activity; yet those are needed for more helpful AI assistance in the operating room. Recognizing surgical actions as triplets of <instrument, verb, target> combination delivers comprehensive details about the activities taking place in surgical videos. This paper presents CholecTriplet2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos. The challenge granted private access to the large-scale CholecT50 dataset, which is annotated with action triplet information. In this paper, we present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge. A total of 4 baseline methods from the challenge organizers and 19 new deep learning algorithms by competing teams are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%. This study also analyzes the significance of the results obtained by the presented approaches, performs a thorough methodological comparison between them, in-depth result analysis, and proposes a novel ensemble method for enhanced recognition. Our analysis shows that surgical workflow analysis is not yet solved, and also highlights interesting directions for future research on fine-grained surgical activity recognition which is of utmost importance for the development of AI in surgery.
translated by 谷歌翻译
The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being $2\times$ faster and requiring $3\times$ less GPU memory than standard Transformer models. We will release our code and models.
translated by 谷歌翻译